Perception Encoder Integration #2478

berniebear · 2025-04-25T21:48:43Z

Add Perception Encoder to timm.

Intro

This PR aims to integrate Perception Encoder (paper, code) from FAIR to timm. We thank you for the support and feedback.

Perception Encoder Performance:

Vision-Language Benchmarks

Model	Checkpoint	IN-1k	IN-v2	IN-A	ObjectNet	COCO-T2I	Kinetics-400	VTT-T2I
B/16 224px	vit_pe_core_base_patch16_224	78.4	71.7	62.4	71.9	50.9	65.6	47.6
L/14 336px	vit_pe_core_large_patch14_336	83.5	77.9	89.0	84.7	57.1	73.4	50.3
G/14 448px	vit_pe_core_gigantic_patch14_448	85.4	80.2	92.6	88.2	58.1	76.9	51.2

Multimodal LLM Benchmarks

Encoder	Checkpoint	Doc VQA	InfoQA	TextVQA	MVBench	PerceptionTest	EgoSchema
L/14 448px	vit_pe_lang_large_patch14_448	81.9	46.4	73.0	52.3	54.7	59.8
G/14 448px	vit_pe_lang_gigantic_patch14_448	84.4	48.3	75.2	52.4	56.0	62.0

Vision-centric Benchmarks

Encoder	Checkpoint	ADE20k Linear Probe 448px w/o TTA	LVIS Mask R-CNN 1024px Box / Mask mAP	COCO DETA 1824px Box mAP
G/14 448px	vit_pe_spatial_gigantic_patch14_448	49.3	54.2 / 49.3	66.0

Proposed integration and changes:

Add pe models in pe.py to timm/models/
Load pe modules in timm/models/init.py
Process PE checkpoints on HF hub (eg facebook/vit_pe_core_base_patch16_224_timm) into the safe tensor format to be loadable in timm (via push_to_hub, suggested by NielsRogge)

Known issues/limitations:

PE's rope is not compatible to timm's layer. Using PE's rope implementation in pe.py for now.
PE's vision transformer (PE class) is customized to use both absolute pos_emb and rope.
The ckpts are in facebook's hf_hub. Need to be copied to timm's hf_hub and remove pre-trained.
Currently ViT only for timm. The text transformer to be integrated in the open_clip repo later.
For PE inference/fine-tuning only. No PE pre-training from scratch support yet (e.g. no progressive resolution/ metaclip curation yet).

PE models available hf_hub path

A. ViT only

vit_pe_core_base_patch16_224: facebook/vit_pe_core_base_patch16_224_timm
vit_pe_core_large_patch14_336: facebook/vit_pe_core_large_patch14_336_timm
vit_pe_core_gigantic_patch14_448: facebook/vit_pe_core_gigantic_patch14_448_timm
vit_pe_lang_large_patch14_448: facebook/vit_pe_lang_large_patch14_448_timm
vit_pe_lang_gigantic_patch14_448: facebook/vit_pe_lang_gigantic_patch14_448_timm
vit_pe_spatial_gigantic_patch14_448: facebook/vit_pe_spatial_gigantic_patch14_448_timm

B. CLIP (ViT + Text transformer. For future open_clip integration only)

pe_core_base_patch16_224: facebook/pe_core_base_patch16_224_timm
pe_core_large_patch14_336: facebook/pe_core_large_patch14_336_timm
pe_core_gigantic_patch14_448: facebook/pe_core_gigantic_patch14_448_timm

Test plan (parity):

import torch
import os, sys
from PIL import Image
import timm

## timm model
model_timm = timm.create_model('vit_pe_core_large_patch14_336', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_core_large_patch14_336_timm'})
model_timm = model_timm.cuda()

import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

## original pe model
model_pe = pe.VisionTransformer.from_config("PE-Core-L14-336", pretrained=True)  # Downloads from HF
model_pe = model_pe.cuda()

preprocess = transforms.get_image_transform(model_pe.image_size)
image = preprocess(Image.open("./apps/pe/docs/assets/cat.png")).unsqueeze(0).cuda()

feat_pe = model_pe(image).detach().cpu().numpy()
feat_timm = model_timm(image).detach().cpu().numpy()
print('feat_pe', feat_pe) # [[ 0.8944705   0.32723966 -0.83092093 ... -0.4582289  -0.76679176 -0.29771212]] 
print('feat_pe.shape', feat_pe.shape) # (1, 1024)
print('feat_timm', feat_timm) # [[ 0.8944705   0.32723966 -0.83092093 ... -0.4582289  -0.76679176 -0.29771212]] 
print('feat_timm.shape', feat_timm.shape) # (1, 1024)

All the models supported and tested:

model_timm = timm.create_model('vit_pe_core_base_patch16_224', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_core_base_patch16_224_timm'})
model_timm = timm.create_model('vit_pe_core_large_patch14_336', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_core_large_patch14_336_timm'})
model_timm = timm.create_model('vit_pe_core_gigantic_patch14_448', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_core_gigantic_patch14_448_timm'})
model_timm = timm.create_model('vit_pe_lang_gigantic_patch14_448', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_lang_gigantic_patch14_448_timm'})
model_timm = timm.create_model('vit_pe_lang_large_patch14_448', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_lang_large_patch14_448_timm'})
model_timm = timm.create_model('vit_pe_spatial_gigantic_patch14_448', pretrained=True, pretrained_cfg = {'hf_hub_id':'facebook/vit_pe_spatial_gigantic_patch14_448_timm'})

Note:

The timm model starts with vit prefix contains only ViT weights (eg vit_pe_core_large_patch14_336).
The PE CLIP checkpoints are placeholder for open_clip after timm integration (e.g. pe_core_gigantic_patch14_448).

Thanks for all the support and feedback for this timm integration!

HuggingFaceDocBuilderDev · 2025-04-27T23:28:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Kamichanw · 2025-04-29T06:09:19Z

timm/models/pe.py

+        elif freqs_for == "lang":
+            freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+        elif freqs_for == "pixel":
+            freqs = torch.linspace(1.0, max_freq / 2, dim // 2) * pi


pi here should load from torch

also, prefer to keep torch.pi vs math.pi and not import x.pi as pi ...

… in modules.

rwightman · 2025-04-29T20:17:54Z

timm/models/pe.py

+        elif freqs_for == "constant":
+            freqs = torch.ones(num_freqs).float()
+        self.freqs = nn.Parameter(freqs, requires_grad=learned_freq)
+


The freqs here is a parameter that isn't in the original model, so their are complaints about this when loading state dict... I assume the behaviour in the pretrained model still matches current code? But for the option of having learned_freq, should this be...

theta *= theta_rescale_factor ** (dim / (dim - 2)) if freqs_for == "lang": freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)) elif freqs_for == "pixel": freqs = torch.linspace(1.0, max_freq / 2, dim // 2) * pi elif freqs_for == "constant": freqs = torch.ones(num_freqs).float() else: assert False if learned_freq: self.freqs = nn.Parameter(freqs) else: self.freqs = nn.Buffer(freqs, persistent=False)

rwightman · 2025-04-29T20:18:27Z

timm/models/pe.py

+        elif freqs_for == "lang":
+            freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+        elif freqs_for == "pixel":
+            freqs = torch.linspace(1.0, max_freq / 2, dim // 2) * pi


also, prefer to keep torch.pi vs math.pi and not import x.pi as pi ...

rwightman · 2025-04-29T20:22:20Z

timm/models/pe.py

+        attn_pooler_heads: int = 8,
+        pool_type: Literal["attn", "tok", "avg", "none"] = "attn",
+        num_classes: int = 0,  # no use for PE
+        in_chans: int = 3,


I do need to add support for a classifier, either in the PE module or wrap everything, otherwise default behaviour for adapting encoders as classifiers doesn't work so well ... I'll figure out how best to support

Add classifier support (and reset) in the new commit. Current forward pass: [x -> Transformer(x)] -> [pool -> proj -> head (for classification)], with forward_features and forward_head respectively. Let's discuss more in the Slack (hf-fair-pe-collab). Thank you!

rwightman · 2025-04-29T20:22:31Z

timm/models/pe.py

+
+        self.conv1 = nn.Conv2d(
+            in_channels=3,
+            out_channels=width,


3 should be in_chans

rwightman · 2025-04-29T20:25:43Z

timm/models/pe.py

+
+
+class Rope2D(Module):
+    def __init__(self, dim, grid_size, use_cls_token=False):


This module should be marked non traceable to pass FX tests as it looks like the if t.ndim == 3 will break tracing

See eg

pytorch-image-models/timm/models/xcit.py

Line 33 in c8c4f25

@register_notrace_module # reason: FX can't symbolically trace torch.arange in forward method

for use of notrace decorator

rwightman · 2025-04-29T20:32:49Z

timm/models/pe.py

+            freq = torch.cat([freq, torch.zeros(1, freq.shape[-1])], dim=0)
+
+        self.freq = Parameter(freq[None, ...]) # remark: using Parameter instead of tensor for device consistency
+


Also complaint about this parameter, was it originally not a parameter as it doesn't exist in state dicts.

changed rope freq to nn.Buffer(freqs, persistent=False). Thanks for the suggestion.

…s cfg/fxforward/fxbackward unitest

…ffer, add drop_rate arg

rwightman · 2025-04-30T22:32:36Z

@berniebear sorry, silly typo in my comments that wasn't in my working hacks, its self.register_buffer not nn.Buffer, haha ...

…l init/reset, fix fused_attn

rwightman · 2025-05-01T20:51:32Z

@berniebear I was fiddling with this a bit today. Looking at the Rope impl here, isn't this line wrong?

            if self.use_cls_token:
                freq = torch.cat(
                    [freq, torch.zeros(1, freq.shape[-1], device=device)], dim=0
                )

The class token is in token position 0 but it's appending the zeros to freq? Also, not quite sure what the point in including class token in the ROPE in the first place, isn't it cleaner to just apply ROPE to the spatial tokens?

berniebear · 2025-05-02T05:19:36Z

Yes, that's a bug in our released repository. The release checkpoint was trained on our internal codebase and has the correct ordering of the CLS token in Rope. Surprisingly this doesn't impact clip related eval. We will fix that in both repos. Thank you for pointing that out.

berniebear · 2025-05-15T20:20:53Z

Close PR as with #2487 is ready

berniebear added 5 commits April 24, 2025 22:39

pe integration

0f6e290

update pe to reuse timm layers

8eafe2c

add default config

e6bbf9f

fix default config

3af564f

renaming models and update checkpoint to vit-only

b82f9e8

berniebear mentioned this pull request Apr 25, 2025

Improve Hugging Face integration facebookresearch/perception_models#13

Closed

remove einops dependencies and reimplement with torch functions

c5c437a

fix config

2327ecc

Kamichanw reviewed Apr 29, 2025

View reviewed changes

refactor Rope2D class for torchscript compatiability. init rope freqs…

8ecc7ca

… in modules.

rwightman reviewed Apr 29, 2025

View reviewed changes

berniebear and others added 4 commits April 30, 2025 02:25

fix jit.trace() issues. add classifier for timm and fix config to pas…

9dbb47d

…s cfg/fxforward/fxbackward unitest

add forward_intermediates support

a040191

torchscript for L/G models at higher resolution

414b775

PE model working with timm train script, fix nn.Buffer -> register_bu…

89d348d

…ffer, add drop_rate arg

berniebear added 2 commits May 1, 2025 11:35

reuse fused_attn from timm, add activation between proj and cls_head

936d20e

fix register, implement all proj and cls_head configurations for mode…

4746845

…l init/reset, fix fused_attn

fix rope bug

2ba13ee

berniebear mentioned this pull request May 7, 2025

Clarification on Fine-tuning PE-Core Models using OpenCLIP Framework facebookresearch/perception_models#32

Open

berniebear marked this pull request as draft May 7, 2025 18:45

berniebear marked this pull request as ready for review May 7, 2025 18:46

berniebear marked this pull request as draft May 7, 2025 18:46

rwightman mentioned this pull request May 10, 2025

Add EVA ViT based PE (Perceptual Encoder) impl #2487

Merged

berniebear closed this May 15, 2025



		class Rope2D(Module):
		def __init__(self, dim, grid_size, use_cls_token=False):

		freq = torch.cat([freq, torch.zeros(1, freq.shape[-1])], dim=0)

		self.freq = Parameter(freq[None, ...]) # remark: using Parameter instead of tensor for device consistency

Uh oh!

Perception Encoder Integration #2478

Perception Encoder Integration #2478

Uh oh!

Conversation

berniebear commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Perception Encoder to timm.

Intro

Perception Encoder Performance:

Vision-Language Benchmarks

Multimodal LLM Benchmarks

Vision-centric Benchmarks

Proposed integration and changes:

Known issues/limitations:

PE models available hf_hub path

Test plan (parity):

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berniebear Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwightman commented Apr 30, 2025

Uh oh!

rwightman commented May 1, 2025

Uh oh!

berniebear commented May 2, 2025

Uh oh!

berniebear commented May 15, 2025

Uh oh!

Uh oh!

berniebear commented Apr 25, 2025 •

edited

Loading

berniebear Apr 30, 2025 •

edited

Loading